Introduction

Extreme computational power of GPU

The main reason to use GPU is the computational power that it offers. There are two major advantages of using GPU over CPU.

  1. Computational Throughput
  2. Extremely High Memory Bandwidth

GPU are designed for Compute Intensive Highly parallel computation.
Therefore more transistors are used for data processing rather than data caching in advanced control logic.

Difference between CPU and GPU

CPU are designed to minimize latency, therefore majority of silicon area is dedicated to :

  • CPU are very good at running serial programs.

    Whereas The GPU are designed to maximize throughput, therefore majority of silicon area is dedicated to :

    In order to utilize the GPU, the parts of program which can be parallelized must be decomposed into large number of threads, that can run concurrently. In CUDA, these threads are defined by special functions called as kernels(kernels are functions which are called on device(GPU)).

    Execution of kernel is called, launching a kernel. When we launch a kernel, kernel is executed as set of threads. Each thread is mapped to a single CUDA cores(each core can run 32 threads concurrently).

    Two main concepts are CPU and GPU. We use the following terminology :

    1. Host : CPU + It's on chip memory
    2. Device : GPU + It's dedicated DRAM

    Often programs contains the both parts : serial and parallel computations. We will run serial programs on CPU and parallel program code (kernel) on GPU. Using both CPU and GPU together is often termed as Heterogeneous Parallel Programming.

    This is where CUDA comes in, as CUDA is 'Heterogeneous parallel programming Language', designed specifically for NVIDIA GPUs.

    CUDA is simply C with set of extensions. In CUDA programming model host is in control of the program. Host and device communicate via PCI bus. PCI bus is very slow relative to host and device, thus increasing cost of exchange of data between the CPU and GPU. Therefore the only portions of the code which are executed on device are those which are massively parallel.

    Kernels gets executed as a set of parallel threads. CUDA is designed to execute thousands of threads(114,688 threads per core).

    CUDA threads execute in a SIMD(Single Instruction Multiple Data) fashion. But for NVIDIA GPUs SIMD is basically SIMT (Single Instruction Multiple Thread).

    Note that, the threads don't finish the task at same rate, since they acts upon different data sets, so time can be little different.

    In order to organise threads on each cores, CUDA uses hierarchy in threads. There are three levels of hierarchy.

    In short, Threads \(\in\) Blocks \(\in\) Grid.

    Grids can be 1d, 2d and 3d.

    Imagine having 1d grid, with 4 blocks (4 x 1). Each block can be 1d, 2d and 3d thread arrangements.

    If blocks are 2d with dimension (4 x 5; 4 thread in x and 5 threads in y), where threads are elements of (5 x 4) matrix. In total we will have 20 threads per block, and for such 4 blocks we will have 80 threads.

    Detailed discussion can be found at Indexing Threads within Grids and Blocks

    Upon launching such kernel with this grid, a total of 80 threads will be executed concurrently.